Problem Statement¶
This report serves as my final project for the CQF program.
The objective is to develop a model that predicts daily upward movements in the stock price of Adobe (ticker: ADBE), using Long Short-Term Memory (LSTM) networks. The choice of Adobe pays tribute to my early career as a graphic designer, during which Adobe tools accompanied me through one of the most creative periods of my life.
The target variable is a binomial classification labeled as $[0, 1]$, where $1$ indicates a positive price movement. The model will be trained, evaluated, and tested on a five-year dataset spanning from January 1, 2020 to July 1, 2025.
The report highlights the project workflow, covering decision flows, mathematical rationale, and analytical insights. For the full executable code, please refer to DL Tuo Li CODE/DL Tuo Li CODE.ipynb.
Content¶
1. Preparation¶
2. Feature Engineering¶
3. Exploratory Data Analysis (EDA)¶
- 3.1 Structural evaluation
- 3.2 SHAP analysis and feature relationship exploration
- 3.3 Analyze multi-collinearity and reduce dimensionality
4. Model Building¶
- 4.1 Prepare dataset
- 4.2 Baseline model - 2 layer LSTM model without dropout
- 4.3 Variant model A - 2 layer LSTM model with dropout
- 4.4 Variant model B - 3 layer LSTM model without dropout
- 4.5 Variant model C - 3 layer LSTM model with dropout
- 4.6 Review of all the models
5. Trading strategy with backtesting¶
6. Conclusion¶
1. Preparation¶
In this project, all data is presented in tabular format.
The main dataset, reffered to as df, contains Adobe trading data from January 1, 2020 to July 1, 2025, sourced directly from MacroTrends.
Key characteristics of the dataset:
Index:
date(one row per trading day)Total rows: 1381 (approx. 5.5 years of trading data)
Columns: 6
open,high,low,close: float values representing price datavolume: integer representing trading volumeweekday: string indicating the trading day (Mon–Fri)
Data Quality : No missing values
Dataset preview (first 5 rows):
| date | open | high | low | close | volume | weekday |
|---|---|---|---|---|---|---|
| 2020-01-02 | 330.000 | 334.480 | 329.170 | 334.430 | 1990496 | Thu |
| 2020-01-03 | 329.170 | 332.980 | 328.690 | 331.810 | 1579368 | Fri |
| 2020-01-06 | 328.290 | 333.910 | 328.190 | 333.710 | 1875122 | Mon |
| 2020-01-07 | 334.150 | 334.790 | 332.305 | 333.390 | 2507261 | Tue |
| 2020-01-08 | 333.810 | 339.230 | 333.400 | 337.870 | 2248531 | Wed |
Statistical summary of the numerical columns:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| open | 1381 | 466.742 | 93.730 | 277.800 | 384.970 | 470.480 | 526.035 | 696.275 |
| high | 1381 | 472.594 | 93.930 | 279.590 | 390.130 | 475.867 | 533.510 | 699.540 |
| low | 1381 | 460.621 | 93.216 | 255.131 | 380.945 | 462.480 | 519.560 | 678.910 |
| close | 1381 | 466.773 | 93.617 | 275.200 | 385.710 | 469.730 | 526.940 | 688.370 |
| volume | 1381 | 3149150.357 | 1864608.743 | 589182 | 2104030 | 2660097 | 3582532 | 27840211 |
2. Feature Engineering¶
In this chapter, I will define the target of the project and create all the possible features with different techniques.
2.1 Define the target¶
2.1.1 Decide the threshold¶
Since this project focuses on predicting the direction of daily returns as a binomial classification $[0, 1]$, it's essential to establish a threshold for distinguishing between positive and negative returns.
Before determining a suitable classification threshold, we will explore the data. The close price is used to represent the daily price od Adobe. Let's observe how the values in close column was evolving in the past 5 years.
Over the past five years, Adobe's price has shown significant volatility with a slight upward drift.
To analyze this further, I will create a return column and examine the distribution of the daily return.
Main dataset with the return column (first 5 rows):
| date | open | high | low | close | volume | weekday | return |
|---|---|---|---|---|---|---|---|
| 2020-01-02 | 330.000 | 334.480 | 329.170 | 334.430 | 1990496 | Thu | NaN |
| 2020-01-03 | 329.170 | 332.980 | 328.690 | 331.810 | 1579368 | Fri | -0.008 |
| 2020-01-06 | 328.290 | 333.910 | 328.190 | 333.710 | 1875122 | Mon | 0.006 |
| 2020-01-07 | 334.150 | 334.790 | 332.305 | 333.390 | 2507261 | Tue | -0.001 |
| 2020-01-08 | 333.810 | 339.230 | 333.400 | 337.870 | 2248531 | Wed | 0.013 |
The distribution histogram of the returns:
As illustrated in the histogram, Adobe's daily returns exhibit a roughly normal distribution with a slight positive skew.
To optimize classification performance, I set the threshold at $0.2\%$ for the following reasons:
It approximately splits the dataset evenly, minimizing class imbalance and helping stabilize model training.
The primary objective is to generate meaningful predictions for positive market moves (class 1). A positive threshold ensures that class 1 predictions are significant if the model performs well, while also providing room to account for transaction costs in reality.
Under this target definition:
A value of $1$ is assigned when the next day's closing price is at least $0.2\%$ higher than the current day's close, indicating a potential buying opportunity. Otherwise, no action is taken.
Returns below $0.2\%$ are labeled as $0$.
Main dataset (first 5 rows) with the newly created target column which indicates whether the next day's return is greater than $0.2\%$ or not:
| date | open | high | low | close | volume | weekday | return | target |
|---|---|---|---|---|---|---|---|---|
| 2020-01-02 | 330.000 | 334.480 | 329.170 | 334.430 | 1990496 | Thu | NaN | 0 |
| 2020-01-03 | 329.170 | 332.980 | 328.690 | 331.810 | 1579368 | Fri | -0.008 | 1 |
| 2020-01-06 | 328.290 | 333.910 | 328.190 | 333.710 | 1875122 | Mon | 0.006 | 0 |
| 2020-01-07 | 334.150 | 334.790 | 332.305 | 333.390 | 2507261 | Tue | -0.001 | 1 |
| 2020-01-08 | 333.810 | 339.230 | 333.400 | 337.870 | 2248531 | Wed | 0.013 | 1 |
2.1.2 Check class imbalance¶
With a threshold of $0.2\%$, in the target column, class 0 ($715$ samples) represents $51.77\%$ of the data, while class 1 ($666$ samples) represents $48.23\%$.
The data is nearly balanced, so we do not need to be concerned about target imbalance.
2.2 Generate features¶
2.2.1 Features from trading data¶
2.2.1.1 Weekday transformation¶
Weekday patterns can influence stock price behavior, so this temporal feature is included in our model.
Instead of using traditional one-hot encoding, which expands weekdays into five separate columns, we apply a trigonometric transformation using sine and cosine functions. This method captures the cyclical nature of weekdays, particularly the continuity between Friday and Monday, while adding only two numerical columns. This improves efficiency and preserves temporal structure.
The formulas for the new features dsin and dcos are:
$$\text{dsin} = sin(\frac{2{\pi}*\text{num}}{7})$$
$$\text{dcos} = cos(\frac{2{\pi}*\text{num}}{7})$$
Where num is the numerical representation of the weekday:
$1$ : Monday
$2$ : Tuesday
$3$ : Wednesday
$4$ : Thursday
$5$ : Friday
This transformation ensures the model understands the cyclical flow of time without inflating the feature space.
Below is the updated dataset (first 5 rows) with the 2 new features. The original weekday column is removed.
| date | open | high | low | close | volume | return | target | dsin | dcos |
|---|---|---|---|---|---|---|---|---|---|
| 2020-01-02 | 330.000 | 334.480 | 329.170 | 334.430 | 1990496 | NaN | 0 | -0.434 | -0.901 |
| 2020-01-03 | 329.170 | 332.980 | 328.690 | 331.810 | 1579368 | -0.008 | 1 | -0.975 | -0.223 |
| 2020-01-06 | 328.290 | 333.910 | 328.190 | 333.710 | 1875122 | 0.006 | 0 | 0.782 | 0.623 |
| 2020-01-07 | 334.150 | 334.790 | 332.305 | 333.390 | 2507261 | -0.001 | 1 | 0.975 | -0.223 |
| 2020-01-08 | 333.810 | 339.230 | 333.400 | 337.870 | 2248531 | 0.013 | 1 | 0.434 | -0.901 |
2.2.1.2 Technical features from trading data¶
Then, I will leverage the function of add_all_ta_features from ta library to automatically generate a wide range of technical indicators from the OHLCV (open, high, low, close, volume) data. This step efficiently enriches the dataset with 80+ indicators across the following categories:
Volume: OBV, CMF, MFI, NVI, ATR, etc.
Volatility: Bollinger Bands, Keltner Channels, Donchian Channels, etc.
Trend: SMA, EMA, MACD, ADX, Ichimoku, Parabolic SAR, Aroon, etc.
Momentum: RSI, Stochastic RSI, ROC, PPO, TSI, etc.
Others: daily returns, log returns, cumulative returns, etc.
After generation, further feature refinement steps are applied:
1. Remove redundant return feature
Since the daily return is already present in the dataset, the duplicated others_dr feature generated by add_all_ta_features is dropped to avoid redundancy.
2. Consolidate sparse PSAR trend signals
Among the trend indicators generated by add_all_ta_features, trend_psar_up and trend_psar_down exhibit a high number of missing values due to the mechanics of the Parabolic SAR (PSAR), which trails price movements using a dynamic stop level. The PSAR is computed as:
$$ PSAR_t = PSAR_{t−1} + AF \cdot (EP_{t−1} − PSAR_{t−1}) $$ Where:
$AF$ is the acceleration factor, starting at $0.02$ and capped at $0.2$
$EP$ is the extreme point (highest high during an uptrend or lowest low during a downtrend)
Explanation of the features:
trend_psar_up: Contains PSAR values during uptrends (dots below price), otherwise NaNtrend_psar_down: Contains PSAR values during downtrends (dots above price), otherwise NaN
To simplify, I create a new feature column psar_trend with directional encoding to interpret the information from these 2 features::
1for uptrend-1for downtrend0for neutral
After generating psar_trend, the original two PSAR columns are removed to reduce sparsity and simplify the feature set.
3. Transform drift/non-stationary price-reflective features
Several generated features directly reflect price movement, inheriting drift and non-stationarity:
trend_sma_fast,trend_sma_slow,trend_ema_fast,trend_ema_slowtrend_ichimoku_conv,trend_ichimoku_base,trend_ichimoku_a,trend_ichimoku_btrend_visual_ichimoku_a,trend_visual_ichimoku_b
Using standard or minmax scalers on them would cause data leakage, as future non-stationary values affect scaling. To address this, these features are converted into relative ratios by being divided with the day’s closing price. For example:
$$\text{trend_sma_fast_div_close} = \frac{\text{trend_sma_fast}}{\text{close}}$$
This transformation reduces drift, normalizes scale, and preserves interpretability without leakage. The original columns of them are then removed and replaced by the newly created ones.
4. Add more features for daily variance
Finally, I manually include two basic range-based features to capture intraday price movement:
$$\text{h-l} = \text{high} - \text{low}$$
$$\text{o-c} = \text{open} - \text{close}$$
After applying feature enrichment from add_all_ta_features and above methods, we now have 95 columns in the main dataset:
Original OHLCV data:
open,high,low,close,volumeReturn and target:
return,targetTrigonometric weekday features:
dsin,dcosVolume features:
volume_adi,volume_obv,volume_cmf,volume_fi,volume_em,volume_sma_em,volume_vpt,volume_vwap,volume_mfi,volume_nviVolatility features:
volatility_bbm,volatility_bbh,volatility_bbl,volatility_bbw,volatility_bbp,volatility_bbhi,volatility_bbli,volatility_kcc,volatility_kch,volatility_kcl,volatility_kcw,volatility_kcp,volatility_kchi,volatility_kcli,volatility_dcl,volatility_dch,volatility_dcm,volatility_dcw,volatility_dcp,volatility_atr,volatility_uiTrend features:
trend_macd,trend_macd_signal,trend_macd_diff,trend_vortex_ind_pos,trend_vortex_ind_neg,trend_vortex_ind_diff,trend_trix,trend_mass_index,trend_dpo,trend_kst,trend_kst_sig,trend_kst_diff,trend_stc,trend_adx,trend_adx_pos,trend_adx_neg,trend_cci,trend_aroon_up,trend_aroon_down,trend_aroon_ind,trend_psar_up_indicator,trend_psar_down_indicator,psar_trendTransformed trend features (drift removed):
trend_sma_fast_div_close,trend_sma_slow_div_close,trend_ema_fast_div_close,trend_ema_slow_div_close,trend_ichimoku_conv_div_close,trend_ichimoku_base_div_close,trend_ichimoku_a_div_close,trend_ichimoku_b_div_close,trend_visual_ichimoku_a_div_close,trend_visual_ichimoku_b_div_closeMomentum features:
momentum_rsi,momentum_stoch_rsi,momentum_stoch_rsi_k,momentum_stoch_rsi_d,momentum_tsi,momentum_uo,momentum_stoch,momentum_stoch_signal,momentum_wr,momentum_ao,momentum_roc,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kamaOther return related features:
others_dlr,others_cr,Daily variance features:
h-l,o-c
All columns contain numeric values, either as floats or integers.
2.2.2 Features from other resources¶
Stock prices can be influenced by a variety of factors beyond trading data itself—such as investor sentiment reflected in news coverage, CDS spreads, and dividend history.
After thorough research, I found that Adobe’s CDS spread data is not publicly available, and the company has not issued any dividends since 2005. As a result, this section will focus on extracting sentiment signals from Adobe-related news headlines published between 1 Jan 2020 and 1 July 2025.
The news data was downloaded and processed in the DL Tuo Li CODE/DL Tuo Li Appendix_news_process.ipynb notebook. The processed news dataset is indexed by date, spanning from January 1, 2020 to July 1, 2025, aligning with the main dataset. It contains two columns that quantify news-driven signals for each calendar day:
news_sentiment_score: indicates whether the news titles expressed a positive and negative feeling that could impact Adobe's price. The value is between $-1$ and $1$, and $-1$ means highly negative, $0$ means nuetral while $1$ means highly positive.news_emotion_intensity: determines if the titles' emotion is high or Low. The value ranges from $0$ to $1$, and $0$ means extremely low while $1$ means extremely high.
These metrics were generated by the AI LLM model that automatically analyzed daily Adobe news titles. If no relevant news was published on a particular day, that date is omitted from the index.
Here is the first 5 row of this news dataset:
| date | news_sentiment_score | news_emotion_intensity |
|---|---|---|
| 2020-01-01 | 0.000 | 0.000 |
| 2020-01-03 | 0.450 | 0.580 |
| 2020-01-06 | 0.000 | 0.000 |
| 2020-01-09 | 0.150 | 0.400 |
| 2020-01-10 | 0.000 | 0.000 |
Then I merge it with the main dataset. And the main dataset has 97 columns now.
2.2.3 Features from macro environment¶
2.2.3.1 Features from QQQ¶
Invesco QQQ (ticker: QQQ) is an exchange-traded fund (ETF) that tracks the Nasdaq-100 Index, which includes 100 of the largest non-financial companies listed on the Nasdaq, such as Apple, Amazon, Adobe and Nvidia.
It is heavily weighted toward large-cap technology stocks, and often viewed as a proxy for the tech sector’s overall health. QQQ's price reflects aggregated investor sentiment toward high-growth, large-cap tech stocks.
QQQ can act as a sentiment barometer for Adobe’s ecosystem. Including QQQ data will improve the model’s ability to predict uptrend probabilities by incorporating macro and sector signals.
The QQQ trading data was also sourced from MacroTrend website, with standard OHLCV information.
Below is the preview of the QQQ dataset with first 5 rows:
| date | open | high | low | close | volume | weekday |
|---|---|---|---|---|---|---|
| 2020-01-02 | 207.096 | 208.796 | 206.690 | 208.796 | 29958247 | Thu |
| 2020-01-03 | 206.023 | 208.129 | 206.014 | 206.883 | 26594637 | Fri |
| 2020-01-06 | 205.251 | 208.245 | 205.009 | 208.216 | 20986764 | Mon |
| 2020-01-07 | 208.293 | 208.776 | 207.530 | 208.187 | 22333269 | Tue |
| 2020-01-08 | 208.129 | 210.708 | 207.830 | 209.752 | 25562588 | Wed |
Stastical summary of the numerical columns:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| open | 1381 | 353.366 | 89.170 | 165.448 | 287.310 | 344.309 | 422.327 | 551.260 |
| high | 1381 | 356.212 | 89.304 | 168.633 | 289.908 | 346.220 | 426.196 | 552.800 |
| low | 1381 | 350.316 | 88.836 | 159.650 | 283.839 | 341.277 | 419.791 | 549.010 |
| close | 1381 | 353.462 | 89.109 | 163.532 | 286.743 | 343.896 | 422.211 | 551.640 |
| volume | 1381 | 48326455.122 | 21716833.598 | 15225092 | 33335184 | 44701851 | 57932013 | 194966806 |
In order to better leverage the information provided by QQQ trading data, we will also need to perform some feature engineering with this dataset.
First, generate a new feature qqq_adobe_corr_20 that captures the correlation between QQQ and Adobe daily returns. By calculating a 20-day rolling correlation, this feature reflects sector-level alignment and helps quantify how closely Adobe’s price movements track broader technology trends represented by QQQ.
Then, add technical features from QQQ's trading data. As QQQ's data is only supportive in this project, I won't add the full set of technical features. Instead, I will focus on several key ones:
10 day SMA divided by close price and 10 day EMA divided by close price
ATR, BBANDS, RIS and MACD
H-L and O-C
Finally, select relevant QQQ features and merge it with the main dataset.
Below is a preview of the selected QQQ features with last 5 rows. For ease of display, it is split into two parts.
| date | qqq_volume | qqq_adobe_corr_20 | qqq_sma_10_div_close | qqq_ema_10_div_close |
|---|---|---|---|---|
| 2025-06-25 | 44804200 | 0.643 | 0.983 | 0.984 |
| 2025-06-26 | 43811400 | 0.635 | 0.977 | 0.980 |
| 2025-06-27 | 57577100 | 0.638 | 0.976 | 0.981 |
| 2025-06-30 | 45548700 | 0.661 | 0.974 | 0.979 |
| 2025-07-01 | 56166700 | 0.638 | 0.985 | 0.990 |
| date | qqq_atr | qqq_bbands_l | qqq_bbands_m | qqq_bbands_u | qqq_rsi | qqq_macd | qqq_h-l | qqq_o-c |
|---|---|---|---|---|---|---|---|---|
| 2025-06-25 | 6.799 | 520.018 | 533.445 | 546.872 | 62.291 | 8.254 | 3.930 | 0.900 |
| 2025-06-26 | 6.452 | 521.058 | 537.010 | 552.962 | 70.425 | 8.802 | 5.150 | -2.870 |
| 2025-06-27 | 6.341 | 528.506 | 541.380 | 554.254 | 68.525 | 9.280 | 5.450 | -0.830 |
| 2025-06-30 | 6.439 | 535.559 | 545.378 | 555.197 | 70.158 | 9.832 | 3.790 | -0.380 |
| 2025-07-01 | 6.509 | 539.252 | 546.820 | 554.388 | 62.261 | 9.781 | 6.050 | 2.740 |
As there are 12 features from QQQ newly included to the main dataset, we now have 109 columns in the main dataset.
2.2.3.2 Features from macro economy¶
Next, the Federal Funds Effective Rate is very possibly influencing Adobe share price movements, as the technology sector is highly sensitive to borrowing costs and liquidity, both of which are directly affected by rate fluctuations.
I have sourced monthly data from the Federal Reserve Bank of St. Louis and this rate is published monthly. Below is a preview with first 5 rows of the data.
| observation_date | FEDFUNDS |
|---|---|
| 2020-01-01 | 1.550 |
| 2020-02-01 | 1.580 |
| 2020-03-01 | 0.650 |
| 2020-04-01 | 0.050 |
| 2020-05-01 | 0.050 |
As the federal rate data is on a monthly basis, we need to map it to our main dataset using a forward-filling technique, ensuring that in the merged dataset, each day's federal rate correctly aligns with the corresponding monthly value.
After incorporating the federal rate data, the feature generation is completed.
We now have 110 columns in the main dataset:
Original OHLCV data:
open,high,low,close,volumeReturn and target:
return,targetTrigonometric weekday features:
dsin,dcosVolume features:
volume_adi,volume_obv,volume_cmf,volume_fi,volume_em,volume_sma_em,volume_vpt,volume_vwap,volume_mfi,volume_nviVolatility features:
volatility_bbm,volatility_bbh,volatility_bbl,volatility_bbw,volatility_bbp,volatility_bbhi,volatility_bbli,volatility_kcc,volatility_kch,volatility_kcl,volatility_kcw,volatility_kcp,volatility_kchi,volatility_kcli,volatility_dcl,volatility_dch,volatility_dcm,volatility_dcw,volatility_dcp,volatility_atr,volatility_uiTrend features:
trend_macd,trend_macd_signal,trend_macd_diff,trend_vortex_ind_pos,trend_vortex_ind_neg,trend_vortex_ind_diff,trend_trix,trend_mass_index,trend_dpo,trend_kst,trend_kst_sig,trend_kst_diff,trend_stc,trend_adx,trend_adx_pos,trend_adx_neg,trend_cci,trend_aroon_up,trend_aroon_down,trend_aroon_ind,trend_psar_up_indicator,trend_psar_down_indicator,psar_trendTransformed trend features (drift removed):
trend_sma_fast_div_close,trend_sma_slow_div_close,trend_ema_fast_div_close,trend_ema_slow_div_close,trend_ichimoku_conv_div_close,trend_ichimoku_base_div_close,trend_ichimoku_a_div_close,trend_ichimoku_b_div_close,trend_visual_ichimoku_a_div_close,trend_visual_ichimoku_b_div_closeMomentum features:
momentum_rsi,momentum_stoch_rsi,momentum_stoch_rsi_k,momentum_stoch_rsi_d,momentum_tsi,momentum_uo,momentum_stoch,momentum_stoch_signal,momentum_wr,momentum_ao,momentum_roc,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kamaOther return related features:
others_dlr,others_cr,Daily variance features:
h-l,o-cNews sentiment features:
news_sentiment_score,news_emotion_intensityQQQ related features:
qqq_volume,qqq_adobe_corr_20,qqq_sma_10_div_close,qqq_ema_10_div_close,qqq_atr,qqq_bbands_l,qqq_bbands_m,qqq_bbands_u,qqq_rsi,qqq_macd,qqq_h-l,qqq_o-c,Federal fund rate
FEDFUNDS
All columns contain numeric values, either as floats or integers.
3. Exploratory Data Analysis (EDA)¶
3.1 Structural evaluation¶
3.1.1 Handle the missing values¶
All data were initially sourced without any missing values. However, during feature generation, $NaN$ values were introduced due to the nature of certain calculation formulas. These missing values typically occur in the initial rows of the dataset, where some formulas couldn’t compute feature values.
To maintain data integrity, we removed all rows containing $NaN$ values. Prior to this cleanup, the dataset comprised 1381 rows; after removal, it contains 1310 rows. As a result, the dataset now begins on April 15, 2020.
3.1.2 Create the feature X and the target y¶
To prepare for exploratory data analysis and future deep learning model training, we begin by constructing the feature dataset X and target dataset y.
Xcontains all columns from the main dataset, excluding the non-stationary pricing fields (open,high,low,close) and thetargetvariable. It includes 105 columns in total.ycorresponds to thetargetcolumn and represents the prediction objective.Both
Xandyconsist of 1,310 rows.
Since X encompasses all engineered features, it will be the primary focus in the upcoming exploratory analysis.
3.1.3 Explore feature distribution and analyze outliers¶
3.1.3.1 Visualize the features¶
This section focuses on analyzing feature distributions to help with suitable scaler selection.
I’ll begin by visualizing each feature in X using histograms and KDE plots to guide the categorization and scaling strategy:
3.1.3.2 Group the features¶
With the visualization of the 105 features, I will group them to 3 categories. Each of the group is paired with an appropriate scaling method to ensure consistent preprocessing:
Bounded distribution: Indicators constrained to a fixed range, showing skewed, bimodal, or other patterns. Scaled using MinMaxScaler.
Normal distribution: Features with symmetric spread and bell-shaped histograms. Best suited for StandardScaler.
Skewed / long-Tailed / clustered distribution: Features with heavy skew, long tails, or clustered spikes. These are sensitive to outliers and scaled using RobustScaler.
After learning the features and observing their distributions, I can reach some initial grouping decisions:
Based on the definition and with reference to the plots, I identified clear candidates for bounded distribution.
Bounded distribution:
dsin,dcos,trend_stc,trend_adx,trend_adx_pos,trend_adx_neg,trend_aroon_up,trend_aroon_down,trend_aroon_ind,trend_psar_up_indicator,trend_psar_down_indicator,momentum_rsi,momentum_stoch_rsi,momentum_stoch_rsi_k,momentum_stoch_rsi_d,momentum_tsi,momentum_uo,momentum_stoch,momentum_stoch_signal,momentum_wr,psar_trend,news_sentiment_score,news_emotion_intensity,qqq_rsi,FEDFUNDSFrom the plots, features with obvious skew, long-tails, or clusters can be spotted as well.
Skewed / long-Tailed / clustered distribution:
volume,volume_adi,volume_obv,volume_cmf,volume_sma_em,volume_vpt,volume_vwap,volume_mfi,volume_nvi,volatility_bbm,volatility_bbh,volatility_bbl,volatility_bbw,volatility_bbp,volatility_bbhi,volatility_bbli,volatility_kcc,volatility_kch,volatility_kcl,volatility_kcw,volatility_kchi,volatility_kcli,volatility_dcl,volatility_dch,volatility_dcm,volatility_dcw,volatility_dcp,volatility_atr,volatility_ui,trend_vortex_ind_pos,trend_vortex_ind_neg,trend_vortex_ind_diff,trend_mass_index,trend_cci,trend_ichimoku_conv_div_close,trend_ichimoku_base_div_close,trend_ichimoku_a_div_close,trend_ichimoku_b_div_close,trend_visual_ichimoku_a_div_close,trend_visual_ichimoku_b_div_close,momentum_ppo,momentum_ppo_signal,momentum_pvo,momentum_pvo_signal,momentum_kama,others_cr,h-l,qqq_volume,qqq_adobe_corr_20,qqq_atr,qqq_bbands_l,qqq_bbands_m,qqq_bbands_u,qqq_macd,qqq_h-lOutlier analysis will be used to finalize the categorization for the rest. These features appear approximately normal, but it is unclear whether their potential outliers may cause them to behave like long-tailed distributions.
Features to be examed with outlier analysis:
return,volume_fi,volume_em,volatility_kcp,trend_macd,trend_macd_signal,trend_macd_diff,trend_trix,trend_dpo,trend_kst,trend_kst_sig,trend_kst_diff,trend_sma_fast_div_close,trend_sma_slow_div_close,trend_ema_fast_div_close,trend_ema_slow_div_close,momentum_ao,momentum_roc,momentum_ppo_hist,momentum_pvo_hist,others_dlr,o-c,qqq_sma_10_div_close,qqq_ema_10_div_close,qqq_o-c
Outlier analysis:
For the 25 features that need outlier analysis, I will apply two complementary metrics: the skewness coefficient and the Interquartile Range (IQR).
Skewness coefficient
It quantifies the asymmetry of the distribution:$0$ : perfectly symmetric distribution
Positive : right-skewed
Negative : left-skewed
Intepretation:
$|\text{Skewness}| < 0.5$ : fairly symmetric, typically fine for StandardScaler
$0.5 ≤ |\text{Skewness}| ≤ 1$ : moderately skewed
$|\text{Skewness}| > 1$ : heavily skewed, often requiring RobustScaler or transformations to reduce skew if needed
Interquartile Range (IQR)
This method identifies statistical outliers by examining how far values deviate from the middle 50% of the data—offering a robust and non-parametric approach that doesn’t assume normality:$IQR$ is the gap between the values of the 75th percentile (Q3) and the 25th percentile (Q1).
Values falling outside the range $[Q1 − 1.5 × IQR, Q3 + 1.5 × IQR]$ are considered outliers.
If outliers account for less than 5% of the data, we consider the feature suitable for normal distribution treatment.
Deicision logic
A feature will be categorized as normally distributed if both conditions are met:$|\text{Skewness}| < 0.5$
$\text{Outlier ratio} < 5\%$
Otherwise, it will be classified as skewed / long-tailed and scaled accordingly.
We will use this decision to update the above feature lists. Meanwhile, to support this process, for the 25 features in the outlier analysis, we generate:
Histogram + KDE plots for shape and modality
Box plots to visualize outliers and IQR
Q-Q plots to assess normality alignment
A summary table with skewness, outlier ratio, and categorization
Below are the histogram, KDE plots, box plots and Q-Q plots of the features.
Below is the summary table of this outlier analysis with skewness, outlier ratio, and categorization for the features.
| Feature | Skewness | Outlier % | Distribution |
|---|---|---|---|
| return | -0.821 | 4.120 | Skewed/Long-tailed |
| volume_fi | -3.316 | 10.760 | Skewed/Long-tailed |
| volume_em | -0.897 | 5.730 | Skewed/Long-tailed |
| volatility_kcp | -0.441 | 2.060 | Normal |
| trend_macd | -0.233 | 0.310 | Normal |
| trend_macd_signal | -0.222 | 0.080 | Normal |
| trend_macd_diff | -0.270 | 0.990 | Normal |
| trend_trix | -0.281 | 1.150 | Normal |
| trend_dpo | 0.393 | 2.370 | Normal |
| trend_kst | -0.057 | 1.910 | Normal |
| trend_kst_sig | -0.050 | 1.760 | Normal |
| trend_kst_diff | 0.140 | 0.310 | Normal |
| trend_sma_fast_div_close | 0.976 | 3.590 | Skewed/Long-tailed |
| trend_sma_slow_div_close | 0.779 | 1.070 | Skewed/Long-tailed |
| trend_ema_fast_div_close | 0.958 | 2.900 | Skewed/Long-tailed |
| trend_ema_slow_div_close | 0.852 | 1.530 | Skewed/Long-tailed |
| momentum_ao | -0.280 | 0.920 | Normal |
| momentum_roc | -0.141 | 1.530 | Normal |
| momentum_ppo_hist | -0.115 | 1.220 | Normal |
| momentum_pvo_hist | 1.106 | 3.510 | Skewed/Long-tailed |
| others_dlr | -1.141 | 4.050 | Skewed/Long-tailed |
| o-c | 0.240 | 1.910 | Normal |
| qqq_sma_10_div_close | 0.764 | 2.140 | Skewed/Long-tailed |
| qqq_ema_10_div_close | 0.827 | 2.520 | Skewed/Long-tailed |
| qqq_o-c | -0.720 | 3.360 | Skewed/Long-tailed |
Based on this outlier analysis, we now have finalized categorization for all the 105 features:
Bounded distribution:
dsin, dcos, trend_stc, trend_adx, trend_adx_pos, trend_adx_neg, trend_aroon_up, trend_aroon_down,
trend_aroon_ind, trend_psar_up_indicator, trend_psar_down_indicator, momentum_rsi, momentum_stoch_rsi,
momentum_stoch_rsi_k, momentum_stoch_rsi_d, momentum_tsi, momentum_uo, momentum_stoch,
momentum_stoch_signal, momentum_wr, psar_trend,
news_sentiment_score, news_emotion_intensity, qqq_rsi, FEDFUNDS
Skewed / long-Tailed / clustered distribution:
volume, volume_adi, volume_obv, volume_cmf, volume_sma_em, volume_vpt,
volume_vwap, volume_mfi, volume_nvi, volatility_bbm, volatility_bbh, volatility_bbl,
volatility_bbw, volatility_bbp, volatility_bbhi, volatility_bbli, volatility_kcc, volatility_kch,
volatility_kcl, volatility_kcw, volatility_kchi, volatility_kcli, volatility_dcl,
volatility_dch, volatility_dcm, volatility_dcw, volatility_dcp, volatility_atr, volatility_ui,
trend_vortex_ind_pos, trend_vortex_ind_neg, trend_vortex_ind_diff, trend_mass_index, trend_cci,
trend_ichimoku_conv_div_close, trend_ichimoku_base_div_close, trend_ichimoku_a_div_close,
trend_ichimoku_b_div_close, trend_visual_ichimoku_a_div_close, trend_visual_ichimoku_b_div_close,
momentum_ppo, momentum_ppo_signal, momentum_pvo, momentum_pvo_signal, momentum_kama, others_cr, h-l, qqq_volume, qqq_adobe_corr_20, qqq_atr, qqq_bbands_l, qqq_bbands_m, qqq_bbands_u, qqq_macd, qqq_h-l, return, volume_fi, volume_em, trend_sma_fast_div_close, trend_sma_slow_div_close, trend_ema_fast_div_close, trend_ema_slow_div_close, momentum_pvo_hist, others_dlr, qqq_sma_10_div_close, qqq_ema_10_div_close, qqq_o-c
Normal distribution:
volatility_kcp, trend_macd, trend_macd_signal, trend_macd_diff, trend_trix, trend_dpo, trend_kst, trend_kst_sig, trend_kst_diff, momentum_ao, momentum_roc, momentum_ppo_hist, o-c
3.1.3.3 Scale the features¶
We’ve completed the structural evaluation of all features by analyzing their distributions and categorizing them into three groups based on scaling suitability.
To close this section, we will scale the features according to the defined categories to ensure consistency in future observation, analysis and model training.
Features with normal distribution will be scaled with StandardScaler.
Features with bounded distribution will be scaled with MinMaxScaler.
Features with skewed / long-Tailed / clustered distribution will be scaled with RobustScaler.
The scaled features are stored in a new dataset X_scaled. It has the same structure (105 columns and 1310 rows) as X, except all its values are properly scaled.
3.2 SHAP analysis and feature relationship exploration¶
The goal of this section is to uncover relationships among features, providing insights into potential multicollinearity and dependencies. Given the large number of features (105), it is impractical to study the bilateral relationships for every possible pair.
To address this efficiently, I will first apply SHAP analysis to identify the 10 most impactful features for the prediction task. These top features will then be explored with pairwise scatter plots. The least relevant features will also be detected and then removed in this process.
3.2.1 SHAP analysis¶
SHAP analysis is a powerful method for interpreting machine learning models by quantifying the impact of each feature on a prediction. In this study, we apply SHAP to an XGBoost classifier, evaluating the influence of features in X_scaled on the target variable y.
XGBoost demonstrated strong performance during the evaluation of my Exam 3 project, which justifies its selection as the model for SHAP-based interpretability in this analysis.
Below is the visualization of the 20 most influential features from the SHAP analysis.
The SHAP summary plot above ranks the top 20 features by their impact on the XGBoost classifier’s predictions, sorted from highest to lowest importance.
Each dot corresponds to a single day's observation. Its horizontal position reflects the feature’s influence on the model output—pushing the prediction toward 1 (right) or toward 0 (left). For example:
A red dot for
returnpositioned on the left suggests that a higher daily return reduces the likelihood of predicting class 0.A blue dot for
o-clocated on the right indicates that a lower open-close price difference increases the probability of class 1.
Then, we calculate the mean absolute SHAP values as the indicator for the importance of the features.
Usually, a mean abusolute SHAP value above $0.1$ typically indicates a feature with strong predictive impact. Among our 105 features, 41 have their values meet this standard.
On the contrary, features with mean abusolute SHAP values below $0.01$ tend to contribute negligibly to the model and may represent statistical noise or redundancy. And in our case, 11 features fall below this threshold.
The result indicates good feature construction: approximately 40% of features exhibit strong influence, contributing meaningfully to model decisions, while only 10% show minimal impact, suggesting limited predictive value.
Next, we will:
- Extract the top 10 most impactful features for deeper relationship analysis.
- Remove the 11 features with mean absolute SHAP values below 0.01, as they contribute minimally to model performance.
Below are the top 10 features and their mean absolute SHAP values. We will explore their relationships in the next section.
| Feature | Mean Abs SHAP |
|---|---|
| return | 0.458 |
| o-c | 0.280 |
| volume_adi | 0.248 |
| qqq_h-l | 0.244 |
| trend_vortex_ind_diff | 0.240 |
| trend_dpo | 0.233 |
| volatility_atr | 0.220 |
| h-l | 0.209 |
| volume_cmf | 0.194 |
| trend_vortex_ind_pos | 0.185 |
Below are the 11 featuers whose mean absolute SHAP values are under $0.01$. They will be removed from the feature dataset X_scaled.
| Feature | Mean Abs SHAP |
|---|---|
| volatility_dcl | 0.008 |
| volatility_bbhi | 0.000 |
| others_dlr | 0.000 |
| volatility_dcm | 0.000 |
| volatility_kchi | 0.000 |
| volatility_kcli | 0.000 |
| trend_psar_down_indicator | 0.000 |
| volatility_bbli | 0.000 |
| psar_trend | 0.000 |
| trend_psar_up_indicator | 0.000 |
| momentum_wr | 0.000 |
After the removal, we now have 94 features in X_scaled.
3.2.2 Explore relationships among high-impact features¶
To uncover potential dependencies and interaction patterns, I use multi-scatter plots (pairplots) on the top 10 most impactful features identified via SHAP analysis.
As illustrated in the visualizations:
Many plots show feature values clustered along vertical lines, suggesting weak correlation between those pairs.
trend_vortex_ind_diffandtrend_vortex_ind_posexhibit a strong positive correlation, forming a pronounced diagonal across their scatter plot.Moderate positive correlations are observed between:
trend_vortex_ind_posandvolume_cmftrend_vortex_ind_diffandvolume_cmfh-landvolatility_atr
Moderate negative correlations appear between:
o-candreturntrend_vortex_ind_diffandtrend_dpo
Most feature comparisons involving
volume_adiproduce two distinct clusters, indicating potential segmentation or bifurcation in behavior.The relationship between
o-candh-lis context-dependent:When
o-cis negative, the correlation is negativeWhen
o-cis positive, the correlation appears positive
These insights help pinpoint underlying dependencies and potential non-linear interactions within the most impactful features. I will document these findings for reference but hold off on taking any immediate action. These observations will be revisited during the multicollinearity analysis phase to inform further feature removal.
3.3 Analyze multi-collinearity and reduce dimensionality¶
3.3.1 VIF analysis and correlation heatmap¶
Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model, and it can be detected using Variable Inflation Factors (VIF).
VIF score of an independent variable represents how well the variable is explained by other independent variables, and it is calculated by the following formula: $$ VIF = \frac{1}{1 − R^2} $$
$R^2$ value is determined to find out how well an independent variable is described by the other independent variables. A high value of $R^2$ means that the variable is highly correlated with the other variables.
VIF starts at $1$ (no correlation) and has no upper limit. In this case, we will remove highly correlated features and target to keep the remaining ones with VIF below $10$.
Compute VIF of the current 94 scaled features and check the 50 features with highest VIF scores below:
| Features | VIF Score |
|---|---|
| trend_aroon_down | inf |
| volatility_kch | inf |
| trend_ichimoku_base_div_close | inf |
| trend_ichimoku_conv_div_close | inf |
| trend_macd | inf |
| momentum_ppo_signal | inf |
| trend_macd_diff | inf |
| trend_vortex_ind_pos | inf |
| trend_vortex_ind_neg | inf |
| trend_vortex_ind_diff | inf |
| trend_kst | inf |
| trend_kst_sig | inf |
| trend_kst_diff | inf |
| trend_aroon_up | inf |
| momentum_pvo_hist | inf |
| trend_aroon_ind | inf |
| momentum_pvo_signal | inf |
| momentum_pvo | inf |
| momentum_ppo_hist | inf |
| volatility_kcl | inf |
| trend_macd_signal | inf |
| volatility_kcc | inf |
| volatility_bbl | inf |
| qqq_bbands_u | inf |
| qqq_bbands_m | inf |
| qqq_bbands_l | inf |
| trend_ichimoku_a_div_close | inf |
| volatility_bbm | inf |
| volatility_bbh | inf |
| momentum_ppo | inf |
| trend_trix | 9482.907 |
| trend_ema_slow_div_close | 8646.385 |
| others_cr | 5781.741 |
| trend_ema_fast_div_close | 3332.137 |
| volume_vwap | 2998.145 |
| volatility_dch | 1427.714 |
| trend_sma_slow_div_close | 960.215 |
| trend_sma_fast_div_close | 740.427 |
| momentum_kama | 615.726 |
| momentum_ao | 298.481 |
| momentum_rsi | 221.841 |
| volatility_bbp | 216.140 |
| trend_visual_ichimoku_a_div_close | 212.918 |
| momentum_tsi | 176.274 |
| qqq_ema_10_div_close | 130.436 |
| trend_cci | 110.998 |
| qqq_sma_10_div_close | 98.291 |
| volatility_dcp | 93.383 |
| volume_vpt | 88.051 |
| momentum_stoch_rsi_k | 83.440 |
Use heatmap to demonstrate the correlations.
As shown in the table, a substantial portion of the feature set exhibits extremely high VIF scores, ranging from four-digit values to infinity, signaling deep and complex interdependencies.
This is vividly supported by the dense and vibrant correlation heatmap, which highlights a tightly woven matrix of relationships across features—visual evidence of severe multicollinearity.
Next, we will reudce dimensionality to mitigate the high multi-collinearity among the features.
3.3.2 Reduce dimensionality¶
To address feature multicollinearity and enhance model robustness, we will try two dimensionality reduction techniques:
Cluster-based selection
Apply unsupervised machine learning (e.g., KMeans) to group features based on similarity.
Within each cluster, retain only the feature with the highest mean absolute SHAP value to represent the group.
Correlation-based filtering
Identify highly correlated feature pairs across the dataset.
For each pair, remove the feature with the lower mean absolute SHAP value to reduce redundancy.
PCA is avoided to preserve interpretability. Additionally, PCA only captures linear dependencies, while our exploratory scatter plots show nonlinear relationships, limiting PCA’s effectiveness in preserving feature structure.
3.3.2.1 Cluster-based selection¶
To apply K-Means clustering on the features, we first need to construct a "feature-of-features" dataset that captures the characteristics of each feature.
Since all features are numeric, we will use the standard .describe() method to summarize their statistical properties, effectively reflecting the quantitative nature of each feature.
There will 94 rows in this dataset, with each row representing a feature. Below is a preview of it with 5 rows:
| count | mean | std | min | 25% | 50% | 75% | Max | |
|---|---|---|---|---|---|---|---|---|
| volume | 1310.000 | 0.334 | 1.322 | -1.461 | -0.387 | 0.000 | 0.613 | 17.880 |
| return | 1310.000 | -0.028 | 0.985 | -7.339 | -0.484 | 0.000 | 0.516 | 6.260 |
| dsin | 1310.000 | 0.578 | 0.382 | 0.000 | 0.277 | 0.723 | 0.901 | 1.000 |
| dcos | 1310.000 | 0.367 | 0.363 | 0.000 | 0.000 | 0.445 | 0.445 | 1.000 |
| volume_adi | 1310.000 | -0.072 | 0.818 | -3.181 | -0.498 | 0.000 | 0.502 | 1.583 |
Next, we’ll apply the Elbow method to identify the optimal number of clusters that align with our objective. Specifically, we’ll group the 94 features into clusters ranging from 2 to 60 and examine how the relative inertia (the measure of within-cluster compactness) declines as the number of clusters increases.
The Elbow plot shows that relative inertia begins to plateau beyond 20 clusters, indicating that KMeans is approaching its limit in terms of meaningful compression. This suggests minimal gain from further increasing the number of clusters.
However, I am concerned that selecting only 20 clusters from 94 may risk discarding many potentially meaningful ones. To balance compression with feature diversity, I choose to retain 30 clusters instead.
Therefore, we group the features to 30 clusters and select a feature with the highest mean absolute SHAP value as a representative in each cluster.
After having the 30 representative features, let's review their VIF scores if they are kept as one feature set:
| Features | VIF Score |
|---|---|
| momentum_pvo | 17808.064 |
| momentum_pvo_signal | 11406.629 |
| momentum_pvo_hist | 4729.073 |
| trend_vortex_ind_diff | 40.739 |
| trend_vortex_ind_pos | 35.398 |
| trend_ichimoku_base_div_close | 13.097 |
| momentum_roc | 10.804 |
| volume_obv | 7.978 |
| volume_vwap | 7.640 |
| trend_adx_neg | 7.230 |
| trend_kst | 6.268 |
| volume_sma_em | 6.024 |
| volume_vpt | 5.005 |
| volume_fi | 4.513 |
| momentum_stoch_rsi | 4.509 |
| volume_adi | 4.097 |
| return | 4.018 |
| o-c | 3.601 |
| volume | 3.524 |
| qqq_volume | 3.296 |
| qqq_sma_10_div_close | 3.067 |
| volatility_atr | 3.021 |
| qqq_h-l | 2.712 |
| h-l | 2.620 |
| qqq_adobe_corr_20 | 2.334 |
| qqq_o-c | 2.318 |
| volume_em | 2.180 |
| volatility_bbw | 2.100 |
| trend_dpo | 2.074 |
| trend_mass_index | 1.948 |
As shown in the table above, while the cluster-based selection helps reduce multicollinearity by removing some features, it also comes with several limitations:
Super-high multicollinearity remains: Some features still exhibit VIF scores in the 5-digit range, which is highly alarming.
Loss of feature diversity: Key macroeconomic indicators like
FEDFUNDS, sentiment-related features, and weekday-based features are excluded during selection. This reduction in feature variety is concerning.No clear direction for cluster adjustment: I won't suggest to adjust the clustering amount as a solution, as increasing clusters may worsen multicollinearity, while decreasing them could further reduce feature diversity.
Given these concerns, I’ll pause further action and explore correlation-based feature selection instead.
3.3.2.2 Correlation-based selection¶
This approach is more straightforward: identify pairs of highly correlated features from the correlation matrix, then remove the one with the lower mean absolute SHAP value in each pair. This strategy balances redundancy reduction with preservation of predictive impact.
Frist, we set threshold at $0.85$, and filter out the pair of features with correlation score higher than it.
As an result, there are 201 feature pairs with correlation coefficients exceeding $0.85$, indicating a dense web of interrelationships. This high degree of multicollinearity suggests that removing even a single feature from a pair may lead to a notable reduction in VIF scores across multiple features.
Then, we examine each highly correlated pair and select the feature with the lower mean absolute SHAP value as a candidate for removal. And based on this process, 49 features are suggested to remove, so we will have 45 features left.
Now, we reassess the feature set of the 45 remaining ones to evaluate whether multicollinearity is sufficiently mitigated, and observe how VIF scores changes.
| Features | VIF Score |
|---|---|
| momentum_uo | 41.073 |
| volume_vpt | 41.071 |
| trend_adx_neg | 40.871 |
| trend_adx_pos | 35.283 |
| qqq_rsi | 34.294 |
| FEDFUNDS | 34.048 |
| volume_vwap | 20.591 |
| momentum_stoch_rsi | 18.633 |
| qqq_bbands_u | 16.832 |
| volatility_ui | 15.668 |
| trend_kst | 13.643 |
| momentum_roc | 13.542 |
| volume_obv | 11.244 |
| trend_ichimoku_conv_div_close | 11.164 |
| trend_stc | 10.569 |
| trend_vortex_ind_diff | 10.361 |
| volatility_atr | 10.116 |
| volatility_kcw | 9.995 |
| trend_cci | 9.498 |
| trend_aroon_up | 9.246 |
| qqq_macd | 8.733 |
| trend_kst_diff | 8.149 |
| trend_adx | 7.869 |
| qqq_atr | 7.281 |
| volume_fi | 6.624 |
| qqq_sma_10_div_close | 5.396 |
| momentum_pvo | 5.268 |
| volume_adi | 4.883 |
| volatility_dcw | 4.660 |
| momentum_pvo_hist | 4.406 |
| return | 4.271 |
| volume | 4.241 |
| dsin | 3.910 |
| o-c | 3.894 |
| qqq_volume | 3.856 |
| qqq_h-l | 3.483 |
| volume_cmf | 3.347 |
| qqq_adobe_corr_20 | 3.192 |
| h-l | 2.920 |
| trend_mass_index | 2.715 |
| qqq_o-c | 2.518 |
| dcos | 2.342 |
| volume_em | 2.339 |
| trend_dpo | 2.207 |
| news_emotion_intensity | 2.141 |
This result is more acceptable than that of the cluster-based selection:
Multicollinearity is significantly reduced, the highest VIF score is now around 40. While VIF scores between 10 and 40 still suggest moderate collinearity, they are unlikely to affect the model we will build substantially now, as LSTM neural networks are less sensitive to multicollinearity than linear models.
Feature diversity is preserved, including:
news_emotion_intensityfor sentiment analysisdcosfor weekday patternsFEDFUNDSfor macroeconomic signals
Since the correlation-based filtering performs well, I will retain this refined feature set for modeling.
Check the correlation heatmap of updated features.
The successful reduction of collinearity is also evident in the more desaturated heatmap.
This concludes the feature selection process. Below is the final list of 45 features we'll use to move forward with the project:
Original OHLCV data:
volumeReturn:
returnTrigonometric weekday features:
dsin,dcosVolume features:
volume_adi,volume_obv,volume_cmf,volume_fi,volume_em,volume_vpt,volume_vwapVolatility features:
volatility_kcw,volatility_dcw,volatility_atr,volatility_uiTrend features:
trend_vortex_ind_diff,trend_mass_index,trend_dpo,trend_kst,trend_kst_diff,trend_stc,trend_adx,trend_adx_pos,trend_adx_neg,trend_cci,trend_aroon_upTransformed trend features (drift removed):
trend_ichimoku_conv_div_closeMomentum features:
momentum_stoch_rsi,momentum_uo,momentum_roc,momentum_pvo,momentum_pvo_histDaily variance features:
h-l,o-cNews sentiment features:
news_emotion_intensityQQQ related features:
qqq_volume,qqq_adobe_corr_20,qqq_sma_10_div_close,qqq_atr,qqq_bbands_u,qqq_rsi,qqq_macd,qqq_h-l,qqq_o-c,Federal fund rate
FEDFUNDS
Next, we proceed to model building with our well-prepared feature set.
4. Model Building¶
Model building is the core of this project. In this chapter, we will explore several LSTM architectures and fine-tune them to identify the optimal structure and hyperparameter configuration for predicting Adobe's next-day upward movement.
Below are some key considerations for this model building exercise:
Architecture exploreation
We concentrate on LSTM models with 2 or 3 layers, with or without dropout:
Baseline model: 2-layer LSTM without dropout
Additional variants:
A: 2-layer LSTM with dropout
B: 3-layer LSTM without dropout
C: 3-layer LSTM with dropout
All models include a final
Denseoutput layer with sigmoid activation for binary classification.Models are evaluated using the metrics of AUC, F1 Score, Accuracy, Recall and Precision.
Due to the limited sample size (
n = 1310), architectures with 4 or more LSTM layers are excluded to avoid overfitting.
Hyperparameter optimization
For each candidate architecture, we will use Bayesian Optimization to tune:
Number of LSTM units per layer: $5$ to $25$, at the step of $5$
Dropout rates (where applicable): $[0.3, 0.4, 0.5, 0.6]$
Learning rate: $[0.0005, 0.001, 0.002]$
Activation functions: $[\text{'elu'}, \text{'relu'}]$
Other key hyperparameter design choices:
Optimization objective: maximize accuracy of validation set (
val_accuracy). Maximizing validation AUC can often lead to zero class 1 prediction in this project due to balanced targets.Training epochs: $200$
Early stopping patience: $20$
Optimizer: Adam
Reproducibility considerations:
Due to TensorFlow’s non‑determinism (LSTM kernels, GPU parallelism) and the stochastic nature of Bayesian search, exact reproducibility of tuning results is not guaranteed.
To mitigate variance, each architecture’s tuning will be run three times, and the best result from those runs will be saved and represent that model.
Although individual runs vary, performance across runs for a given structure should fall within a consistent range, allowing meaningful comparisons.
Others:
Training progress is logged using TensorBoard for transparent tracking.
Among the 4 optimized models, the one with best evaluation result on testing set will be selected for backtesting in the next chapter.
4.1 Prepare dataset¶
4.1.1 Split and scale the dataset¶
To ensure a leak-free modeling pipeline, we begin by reverting to the unscaled version of the dataset. From there:
Restrict the dataset to the 45 features finalized in previous EDA phases.
Split this unscaled data into training ($70\%$), validation ($15\%$) and testing sets ($15\%$), preserving chronological order for time-series integrity.
Fit scaling transformations (StandardScaler, MinMaxScaler, RobustScaler for respective features as appropriate) only on the training set to prevent future data (validation and testing sets) from influencing learned parameters.
Apply the fitted scalers to the validation and testing sets to maintain consistency.
This approach preserves temporal structure and statistical independence across splits, laying a clean foundation for reliable LSTM training.
We now have the following datasets prepared for training and evaluation:
Training set:
X_train_scaled($917$ samples, $45$ features),y_train($917$ target labels)Validation set:
X_val_scaled($196$ samples, $45$ features),y_val($196$ target labels)Testing set:
X_test_scaled($197$ samples, $45$ features),y_test($197$ target labels)
The datasets are now fully prepared for LSTM model building.
4.1.2 Create data generator¶
In this project of building LSTM model, a critical design choice is selecting an appropriate sequence length — the number of consecutive past days used as input for predicting the following day.
For a technology stock like Adobe, a span of approximately one month ($20–22$ trading days) offers a practical balance:
Captures medium-term momentum and volatility trends
Preserves recent signal strength
Avoids overextending input length, which could dilute temporal relevance and increase model complexity
Based on this reasoning, we set the sequence length to $21$ days.
Meanwhile, we use TimeseriesGenerator from Keras to efficiently generate batches of sequential data for time-series modeling.
A batch refers to a group of input sequences of features and their corresponding target values processed together during training or evaluation. Batching improves computational efficiency and allows for smoother gradient updates during optimization.
Here, the batch size is set to $32$, balancing training stability and computational efficiency given the moderate dataset size. During generator creation, any remaining samples that fall short of a full batch are discarded.
Following this discussion, we will have three generators:
g_train: $917//32 = 28$ batchesg_val: $196//32 = 6$ batchesg_test: $197//32 = 6$ batches
Each batch contains:
$32$ (batch size) feature sequences, each representing the most recent $21$ (sequence length) trading days
Each day includes $45$ features, resulting in a feature batch shape of $(32, 21, 45)$
$32$ corresponding target values, each indicating whether the stock price moved up on the day following the 21-day window
4.2 Baseline model - 2 layer LSTM model without dropout¶
4.2.1 Build the model (2 layers without dropout)¶
After structuring the architecture and tuning the hyperparameters within the planned search space, the optimized model has the following specifications:
Layer 1 units : $20$
Layer 2 units : $5$
Learning rate : $0.002$
Activation (layer 1) : relu
Activation (layer 2) : relu
The illustration below depicts the finalized model structure.
Screenshots of key visuals on TensorBoard results (TIME SERIES tab):
Intepreatation:
In the
epoch_accuracyplot, across all the hyperparameter trials, most of them showed increasing validation accuracy, roughly converging in 3 clusters $[0.5, 0.56, 0.63]$ by epoch 25. The best-performing run achieved a validation accuracy of ~$0.68$, indicating solid learning progress. This reflects effective hyperparameter tuning to improve the accuracy.The
epoch_aucplot shows that most model configurations gradually improve their validation AUC over training epochs. Although the AUCs of some trails started around or below $0.5$ (random guessing), many of them reached above $0.65$, and the best even delivers $0.74$, indicating good discriminative power accumulated during the training.The
epoch_losspanel shows the binary cross-entropy loss over training epochs for different hyperparameter configurations. Among all the trials, approximately half showed steadily decreasing loss values, indicating effective learning. Other trials exhibited oscillating loss curves and entered early stopping very soon, suggesting less effective hyperparameter settings.Generally speaking, most trails can finish learning around 30 epochs with improved accuracy, AUC and decreased loss. The optimized model from this training experience should demonstrate certain value in the later evaluation.
4.2.2 Evaluate the model (2 layers without dropout)¶
4.2.2.1 Evaluate the training data against the testing data¶
The trained model was evaluated on both the training and testing datasets using g_train and g_test, respectively. The observed accuracies are:
- Training accuracy: $0.63$
- Testing accuracy: $0.55$
4.2.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis¶
The model's performance on the testing dataset g_test is detailed below, including ROC Curve, Confusion Matrix and Classification Report:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.56 | 0.70 | 0.62 | 92 |
| 1 | 0.54 | 0.39 | 0.46 | 84 |
| Accuracy | 0.55 | 176 | ||
| Macro Avg | 0.55 | 0.54 | 0.54 | 176 |
| Weighted Avg | 0.55 | 0.55 | 0.54 | 176 |
4.2.2.3 Baseline model evaluation summary¶
Training Metrics
- Accuracy: $0.63$
Testing Metrics
Accuracy: $0.55$
AUC: $0.52$
Class 1 precision: $0.54$
Class 1 recall: $0.39$
Class 1 F1 score: $0.46$
Interpretation
Moderate learning without extreme overfitting: Training accuracy of $0.63$ suggests the model learned moderately and testing accuracy of $0.55$, indicates limited generalization with some room for performance gains.
Limited discriminative power: AUC of $0.52$ shows near-random separation capability.
Weak signal capture: precision ($0.54$), recall ($0.39$), and F1 score ($0.46$) suggest the model struggles to identify upward trends.
It’s an acceptable starting point for iterative refinement. In our futher effort, additional layer may improve signal capture.
4.3 Variant model A - 2 layer LSTM model with dropout¶
4.3.1 Build the model (2 layers with dropout)¶
After structuring the architecture and tuning the hyperparameters within the planned search space, the optimized model has the following specifications:
Layer 1 units : $5$
Layer 2 units : $5$
Dropout rate after layer 1 : $0.5$
Learning rate : $0.002$
Activation (layer 1) : relu
Activation (layer 2) : relu
The illustration below depicts the finalized model structure.
Screenshots of key visuals on TensorBoard results (TIME SERIES tab):
Intepreatation:
In the
epoch_accuracyplot, across all the hyperparameter trials, most showed increasing validation accuracy, and more trails had clear upward tendency than the baseline model. But the best-performing run achieved a validation accuracy of ~$0.66$, less than the baseline model. This shows adding dropout improves the chance of better performance, but may not necessarily push the upper limit of accuracy in our project.The
epoch_aucplot shows that most model configurations gradually improved their validation AUC over training epochs. Similar to the accuracy plot, more trails in this model demonstrated enhancement over the epochs than the baseline model, yet the best AUC $0.72$ didn't exceed the baseline model's record.The
epoch_lossalso illustrates 2 splits: half trails showed steadily decreasing loss values, indicating effective learning. Others exhibited oscillating loss curves. Compared to the baseline model where all the trails finished within around $30$ epochs, many trails here with decreasing loss took more epochs (~$60$) to reach early stopping.Generally, this model with dropout requires more training epochs before meeting early stopping criteria based on loss reduction. While most trials outperformed those from the baseline model, the top-performing trial did not surpass the strongest baseline counterpart.
4.3.2 Evaluate the model (2 layers with dropout)¶
4.3.2.1 Evaluate the training data against the testing data¶
The trained model was evaluated on both the training and testing datasets using g_train and g_test, respectively. The observed accuracies are:
- Training accuracy: $0.58$
- Testing accuracy: $0.51$
4.3.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis¶
The model's performance on the testing dataset g_test is detailed below, including ROC Curve, Confusion Matrix and Classification Report:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.55 | 0.37 | 0.44 | 92 |
| 1 | 0.49 | 0.67 | 0.57 | 84 |
| Accuracy | 0.51 | 176 | ||
| Macro Avg | 0.52 | 0.52 | 0.50 | 176 |
| Weighted Avg | 0.52 | 0.51 | 0.50 | 176 |
4.3.2.3 Model evaluation summary¶
Training Metrics
- Accuracy: $0.58$
Testing Metrics
Accuracy: $0.51$
Class 1 precision: $0.49$
Class 1 recall: $0.67$
Class 1 F1 score: $0.57$
AUC: $0.51$
Interpretation
Moderate learning without extreme overfitting: Training accuracy of $0.63$ suggests the model learned moderately and testing accuracy of $0.55$, indicates limited generalization with some room for performance gains.
Aggressive positive detection: High recall for class 1 ($0.67$) paired with moderate precision ($0.49$) shows the model favors identifying positives, possibly at the cost of false alarms.
Limited discriminative power: AUC of $0.51$ shows the model still deosn't perform too much better than random guess.
While the signal capture is increased from the baseline model, it is somehow at the cost of class 1 precision. We will then explore how the additional layer can help improve the overal performance.
4.4 Variant model B - 3 layer LSTM model without dropout¶
4.4.1 Build the model (3 layers without dropout)¶
After structuring the architecture and tuning the hyperparameters within the planned search space, the optimized model has the following specifications:
Layer 1 units : $15$
Layer 2 units : $25$
Layer 3 units : $5$
Learning rate : $0.002$
Activation (layer 1): elu
Activation (layer 2): elu
Activation (layer 3): relu
The illustration below depicts the finalized model structure.
Screenshots of key visuals on TensorBoard results (TIME SERIES tab):
Intepreatation:
In the
epoch_accuracyplot, across all the hyperparameter trials, most of them showed increasing validation accuracy, demonstrating similar patterns as the baseline model except taking more epochs. The best-performing run also achieved a validation accuracy of ~$0.68$, similar to the best result from the baseline model as well.The
epoch_aucplot shows that most model configurations gradually improved their validation AUC over training epochs. Although the AUCs of some trials started around or below $0.5$ (random guessing), the best trail deliverd AUC ~$0.75$, very close to that from the baseline model.The
epoch_lossplot also demonstrates high level similarity to its baseline model counterpart. Among all the trials, approximately half showed steadily decreasing loss values, indicating effective learning. Other trials exhibited oscillating loss curves, suggesting less effective hyperparameter settings.Overall, compared to the baseline model, adding an extra layer to this model had limited impact on the training experience. The only notable change was that the trails required more epochs to reach early stopping.
4.4.2 Evaluate the model (3 layers without dropout)¶
4.4.2.1 Evaluate the training data against the testing data¶
The trained model was evaluated on both the training and testing datasets using g_train and g_test, respectively. The observed accuracies are:
- Training accuracy: $0.60$
- Testing accuracy: $0.52$
4.4.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis¶
The model's performance on the testing dataset g_test is detailed below, including ROC Curve, Confusion Matrix and Classification Report:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.53 | 0.73 | 0.61 | 92 |
| 1 | 0.50 | 0.30 | 0.37 | 84 |
| Accuracy | 0.52 | 176 | ||
| Macro Avg | 0.52 | 0.51 | 0.49 | 176 |
| Weighted Avg | 0.52 | 0.52 | 0.50 | 176 |
4.4.2.3 Baseline model evaluation summary¶
Training Metrics
- Accuracy: $0.60$
Testing Metrics
Accuracy: $0.52$
Class 1 precision: $0.50$
Class 1 recall: $0.30$
Class 1 F1 score: $0.37$
AUC: $0.52$
Interpretation
Moderate learning without extreme overfitting: Training accuracy of $0.60$ suggests the model learned moderately and testing accuracy of $0.52$, indicates limited predictive power.
Conservative class 1 prediction: moderate precision ($0.50$) with low recall ($0.30$) suggests the model predicts positives cautiously, missing many true positives.
Limited discriminative power: AUC of $0.52$ indicates the model’s ability to distinguish between classes still remains close to chance level.
Adding another layer doesn't improve the performance of the model. Even 2 layer model with dropout can demonstrate slighly more value.
4.5 Variant model C - 3 layer LSTM model with dropout¶
4.5.1 Build the model (3 layers with dropout)¶
After structuring the architecture and tuning the hyperparameters within the planned search space, the optimized model has the following specifications:
Layer 1 units : $15$
Layer 2 units : $10$
Layer 3 units : $15$
Dropout rate after layer 1 : $0.5$
Dropout rate after layer 2 : $0.5$
Learning rate : $0.002$
Activation (layer 1): elu
Activation (layer 2): relu
Activation (layer 3): relu
The illustration below depicts the finalized model structure.
Screenshots of key visuals on TensorBoard results (TIME SERIES tab):
Intepreatation:
In the
epoch_accuracyplot, across all the hyperparameter trials, most of them showed increasing validation accuracy. Yet the starting accuracy was quite low from all trails, and the best accuracy $0.64$ was the worst among all the models.The
epoch_aucplot shows that most model configurations gradually improve their validation AUC over training epochs. Although the AUC improvement demonstrated similar trends compared to other models, the best value $0.71$ was still the lowest.The
epoch_losspanel also reveals two distinct patterns: roughly half of the trials show consistently decreasing loss, suggesting effective learning, while the others display oscillating or upward-trending loss curves — an undesirable behavior not observed in other models. The lowest recorded loss was $0.62$, the worst among all the models again. Most trials reached early stopping near epoch 47, with only two outliers extending to epoch 62.With the extra layer and dropout, this model even delivered worse results with more epochs during the training, suggesting that for this project, additional neurons and regularization may not be a good combination to improve performance.
4.5.2 Evaluate the model (3 layers with dropout)¶
4.5.2.1 Evalute the training data against the testing data¶
The trained model was evaluated on both the training and testing datasets using g_train and g_test, respectively. The observed accuracies are:
- Training accuracy: $0.53$
- Testing accuracy: $0.53$
4.5.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis¶
The model's performance on the testing dataset g_test is detailed below, including ROC Curve, Confusion Matrix and Classification Report:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.54 | 0.73 | 0.62 | 92 |
| 1 | 0.52 | 0.32 | 0.40 | 84 |
| Accuracy | 0.53 | 176 | ||
| Macro Avg | 0.53 | 0.52 | 0.51 | 176 |
| Weighted Avg | 0.53 | 0.53 | 0.51 | 176 |
4.5.2.3 Model evaluation summary¶
Training Metrics
- Accuracy: $0.53$
Testing Metrics
Accuracy: $0.53$
Class 1 precision: $0.52$
Class 1 recall: $0.32$
Class 1 F1 score: $0.40$
AUC: $0.55$
Interpretation
Moderate learning with consistency: Training and testing accuracy tie at $0.53$, indicating zero overfitting. But the drop of training accuracy suggests the model doesn't learn as much as previous ones.
Moderate precision, low recall for class 1: The model predicts positives fairly accurately (precision: $0.52$ but misses many actual positives (recall: $0.32$), pulling the F1 score to $0.40$.
Limited discriminative power: AUC of $0.55$ shows improvement on distinguishing between classes but still underwhelming.
While dropout helps reduce overfitting and enhances discriminative power, the overall accuracy shows no improvement, making it difficult to consider this model the best.
4.6 Review of all the models¶
Let's collect all the performance data and put them together in a dataframe for a clear comparision.
| training accuracy | testing accuracy | train/test accuracy gap | AUC | class 1 precision | class 1 recall | class 1 F1 score | |
|---|---|---|---|---|---|---|---|
| 2 layers without dropout (baseline model) | 0.630 | 0.550 | 0.080 | 0.520 | 0.540 | 0.390 | 0.460 |
| 2 layers with dropout (variant model A) | 0.580 | 0.510 | 0.070 | 0.510 | 0.490 | 0.670 | 0.570 |
| 3 layers without dropout (variant model B) | 0.600 | 0.520 | 0.080 | 0.520 | 0.500 | 0.300 | 0.370 |
| 3 layers with dropout (variant model C) | 0.530 | 0.530 | 0.000 | 0.550 | 0.520 | 0.320 | 0.400 |
Baseline model (2 layers without dropout) has the highest training ($0.63$), testing accuracy ($0.55$) and precision ($0.54$), but suffers from overfitting (train/test gap: $0.08$) and low recall ($0.39$) on class 1 (uptrend).
Variant model A (2 layers with dropout) shows better class 1 recall ($0.67$) and highest class 1 F1 score ($0.57$), indicating stronger detection of uptrends, despite modest accuracy and AUC. Its slight drop in accuracy is an acceptable trade-off for its predictive strength on the target class. It is the only model with a higher recall than precision, demonstrating the most aggresive predicting style among the 4 models.
Variant model B (3 layers without dropout) offers no significant accuracy or AUC improvement and has the lowest class 1 recall ($0.30$) and F1 score ($0.37$), suggesting it's less effective at identifying uptrends. Its train/test accuracy gap and AUC are the same as those of the baseline model, while precision and recall follow a similar pattern as well. It can be seen as a resembling but weaker version of the baseline model.
Variant model C (3 layers with dropout) achieves the lowest overfitting ($0.00$ gap) and highest AUC ($0.55$), indicating balanced generalization from dropout and better discriminative power possible contributed by the additional layer. But its recall ($0.32$) and F1 ($0.40$) remain low, indicating weak capability to catch upward movements. Its complexity didn't translate to substantially better classification metrics.
Dropout generally improves generalization: Comparing both pairs (2-layer vs. 2-layer with dropout and 3-layer vs. 3-layer with dropout), dropout reduced overfitting and improved recall for detecting uptrends.
Adding a third LSTM layer doesn't yield clear benefits: Increasing depth from 2 to 3 layers did not improve accuracy, AUC, or class 1 performance. In fact, the 3-layer models (with or without dropout) showed lower F1 scores ($0.37–0.40$) and poorer recall than the 2-layer with dropout model.
Recommended Model for Backtesting: Baseline model (2 layers without dropout)¶
Adding an extra layer negatively impacted model performance, so neither Variant Model B nor C will be considered the best candidate.
Then, I conducted backtesting for both baseline model and variant model A, and the baseline model showed slightly better performance.
This is because Variant A achieved a relatively high F1 score ($0.57$), driven by strong recall ($0.67$) but limited precision ($0.49$), reflecting an aggressive prediction style that led to more false alarms. Given the high volatility and frequent downturns in Adobe’s share price between 2020 and 2025, these false alarms would introduce significant negative returns that offset the gains from correct predictions.
As a result, a more conservative strategy is better suited for backtesting such a volatile asset. The baseline model, with its higher precision and accuracy, is the preferred choice for final deployment.
5. Trading strategy with backtesting¶
In this chapter, we encapsulate the baseline model (2-layer without dropout) as a trading strategy to assess its performance.
Specifically, we reapply the model to the full Adobe dataset from 2020 to 2025 and evaluate its results using a range of analytical techniques.
5.1 Profit analysis¶
To assess the effectiveness of the baseline model, we will conduct a simple backtesting trading exercise with the 5-year Adobe price data. The trading rules are straightforward:
Buy Signal (Predicted = 1): Purchase one share of Adobe at the day's closing price.
Sell on the Next Day: If a purchase occurs, the position will be sold at the next day's closing price.
No Trade (Predicted = 0): No transaction is executed on this day related to this predicition.
No Transaction Costs or Friction Included: This backtest assumes a cost-free trading environment for simplicity.
Trades following these rules will be referred to as the LSTM Strategy in the subsequent analysis.
After a series of calculations, I derived the daily cumulative profit from trading using the LSTM strategy in the past 5 years. For comparison, I also computed the daily cumulative profit following a buy-and-hold approach.
The illustration below visualizes the performance of both strategies over time.
Observations from the profit analysis:
LSTM strategy generated $\$1153.52$ per share, while buy-and-hold yielded only $\$36.71$ — a 30x profit improvement.
Adobe stock was highly volatile with limited overall growth during the 5 years, leading to frequent losses for buy-and-hold investors.
LSTM strategy delivered smoother, more consistent gains with much lower exposure to volatility.
Major market drawdowns (e.g., 2022, early 2024, early 2025) had limited impact on LSTM, showing the model's strong downside protection.
Overall, the LSTM strategy not only vastly outperformed the buy-and-hold approach in absolute profit, but also provided a more stable and resilient path through a volatile market environment. Its ability to sidestep downturns while maintaining consistent growth highlights its strength as an active trading strategy.
5.2 Rolling Sharpe ratio analysis¶
The Sharpe ratio measures return relative to risk:
$$\text{Sharpe Ratio}=\frac{\text{Mean Return}}{\text{Standard Deviation of Return}}$$ A rolling Sharpe ratio calculates this over a moving time window (e.g. 126 trading days ≈ 6 months), so we can see how a strategy’s risk-adjusted performance evolves. It highlights the model’s consistency, stability, and adaptability across different market conditions.
In this analysis, we compute the 126-day rolling Sharpe ratio for both the LSTM-based strategy and the buy-and-hold benchmark over the past 5 years, and compare their trajectories as shown in the below illustration to assess relative performance.
Observations from the rolling Sharpe ratio comparison:
LSTM strategy maintains consistently higher Sharpe ratios than buy-and-hold across all periods.
Both strategies follow similar trends, indicating shared market exposure, but LSTM exhibits stronger resilience.
LSTM stays largely above zero, reflecting sustained positive risk-adjusted returns.
During major drawdowns (e.g., late 2021–mid 2022 and late 2023–mid 2024), the LSTM strategy shows less severe declines and quicker recoveries, indicating better downside control.
Overall, the visualization highlights the LSTM strategy’s great ability to navigate volatile market conditions while preserving return consistency. Its elevated and stable Sharpe ratios shows effective signal learning and risk management.
5.3 Underwater curve analysis¶
An underwater curve is a visual representation of drawdowns over time — it shows how far an investment is below its previous peak.
$$\text{Underwater} = \frac{\text{Current Cumulative Return}}{\text{Historical Peak}} - 1$$ The values are always smaller than or equal to zero (zero means it's at a peak), and more negative means deeper drawdown.
We can still use the daily return information from the LSTM strategy and the buy-and-hold approach to calculate their underwater values in the past 5 years and visualize them accordingly in the below illustration.
Observations from the underwater curve comparison:
LSTM strategy exhibits frequent recovery to new highs, keeping its drawdowns shallow and short-lived. Most underwater periods remain above $-20\%$, and recovery often occurs within months.
In contrast, the buy-and-hold approach suffers from long and deep drawdowns, including multi-year recovery periods and max drawdowns reaching $-60\%$.
The volatility of drawdowns in LSTM is much lower than that of buy-and-hold, indicating better risk control from the baseline model.
In recent years, especially from 2022 to 2025, LSTM’s ability to recover quickly after market downturns contrasts sharply with buy-and-hold’s persistent underwater state.
The underwater plot clearly shows that the LSTM strategy provides strong downside protection and quicker recovery compared to the buy-and-hold approach. While both strategies experience drawdowns during turbulent periods, LSTM’s drawdowns are more contained and typically followed by swift rebounds, indicating stronger resilience.
5.4 Pyfolio analysis¶
Pyfolio is a Python library for analyzing portfolio performance and risk management, making it especially valuable for evaluating backtested trading strategies. It provides a comprehensive set of metrics to assess a strategy’s effectiveness.
Below are the metrics provided by Pyfolio to evaluate the backtesting of the LSTM strategy and buy-and-hold approach.
| Metrics | LSTM strategy | Buy-and-hold approach |
|---|---|---|
| Start Date | 2020-05-14 | 2020-05-14 |
| End Date | 2025-07-01 | 2025-07-01 |
| Total Months | 61 | 61 |
| Annual Return | 62.4% | 1.8% |
| Cumulative Returns | 1094.1% | 9.4% |
| Annual Volatility | 26.1% | 36.0% |
| Sharpe Ratio | 1.99 | 0.23 |
| Calmar Ratio | 2.77 | 0.03 |
| Stability | 0.97 | 0.00 |
| Max Drawdown | -22.5% | -60.0% |
| Omega Ratio | 1.63 | 1.04 |
| Sortino Ratio | 3.04 | 0.31 |
| Skew | -0.36 | -0.83 |
| Kurtosis | 16.43 | 7.77 |
| Tail Ratio | 1.38 | 0.95 |
| Daily Value at Risk | -3.1% | -4.5% |
Observations from the metrics by Pyfolio:
LSTM strategy achieved $62.4\%$ annual return, much higher than $1.8\%$ from the buy-and-hold. The gap of cumulative returns ($1094.1\%$ vs. $9.4\%$) is also quite significant.
Sharpe ratio ($1.99$) and Sortino ratio ($3.04$) indicate strong risk-adjusted returns for LSTM; buy-and-hold lags far behind with Sharpe of $0.23$ and Sortino of $0.31$.
LSTM volatility is lower ($26.1\%$) compared to buy-and-hold ($36.0\%$), showing more stable performance despite higher returns.
Max drawdown is much smaller for LSTM ($-22.5\%$) than buy-and-hold ($-60\%$), signaling better downside protection. This was also discussed in the waterdown curve analysis.
The LSTM strategy demonstrates not just impressive returns, but a good balance of profitability and risk management. It offers a significantly more attractive and robust approach for navigating volatile markets, making it a strong candidate for active trading.
6. Conclusion¶
In this project, I explored the application of LSTM-based deep learning models in predicting upward movements of Adobe stock, and designed a strategy that outperforms traditional approaches in both returns and risk control.
Through a comprehensive process — from sourcing and cleaning data, crafting features using diverse techniques, and performing exploratory analysis with multicollinearity reduction and scaling — to building and tuning baseline and variant models, the workflow reflects a rigorous approach of quantitative modeling.
Beyond the technical outcomes, this project was a valuable learning experience for me as well. It deepened my understanding of deep learning in finance, enhanced my modeling skills, and strengthened my command of Python programming and modern ML libraries. I’m grateful to the CQF program for equipping me with the skills to complete this project, and to my family for their unwavering support during this intensive learning phase.